Combining mass spectrometry and machine learning to discover bioactive peptides

Peptides play important roles in regulating biological processes and form the basis of a multiplicity of therapeutic drugs. To date, only about 300 peptides in human have confirmed bioactivity, although tens of thousands have been reported in the literature. The majority of these are inactive degradation products of endogenous proteins and peptides, presenting a needle-in-a-haystack problem of identifying the most promising candidate peptides from large-scale peptidomics experiments to test for bioactivity. To address this challenge, we conducted a comprehensive analysis of the mammalian peptidome across seven tissues in four different mouse strains and used the data to train a machine learning model that predicts hundreds of peptide candidates based on patterns in the mass spectrometry data. We provide in silico validation examples and experimental confirmation of bioactivity for two peptides, demonstrating the utility of this resource for discovering lead peptides for further characterization and therapeutic development.


nature research | reporting summary
April 2020 Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative. Validation with this paper. The raw mass spectrometry data and processed search files are publicly available at the ProteomeXchange Consortium via the PRIDE partner repository with the data set identifier PXD022225 (https://www.ebi.ac.uk/pride/archive/projects/PXD022225/). Public peptide databases such as SwePep (http:// www.swepep.org/), Uniprot (https://www.uniprot.org/) and NeuroPep (http://isyslab.info/NeuroPep/) were used for Supplementary Data 3.
Peptidomics study involving n=12 mice in 4 different genetic or diet backgrounds (n=48 in total). This number is sufficient to detect differences in the peptidome as shown in the manuscript across genetic strain background. In training the PPV model it was desired to capture as much heterogeneity as possible in the peptidome from n=7 different tissues/organs, creating a total of n=336 samples analyzed. Highscoring PPV predictions were tested in n=3 diabetic mice for acute ability to influence blood glucose levels in-vivo, or in relevant in-vitro assays to minimize the number of experimental animals used. Any positive indication was reproduced in at least n=7 animals to increase statistical power. We chose n=3 mice to provide enough power for observing a minimum 20% change in BG, and n=7~10 was chosen based on a minimum of 10% change in BG based on historical variation within internal studies in this experimental paradigm.
No animals were excluded from the acquisition of the mass spectrometry data, downstream analysis or PPV training, except one rawfile (Diabetic mouse_09 from Sc. Fat) which was truncated during MS acquisition and subsequently removed from downstream analysis. Peptides below 7 amino acids in length or with a Mascot score below 20 were discarded computationally. For hydrodynamic gene delivery animals were excluded if they exhibited poor status after tail veil injection, and not used for terminal plasma collection.
Peptides were screened in-vivo in n=3 db/db mice, and replicated in n=7 or n=10 animals. All in-vitro data was at least replicated as two independent biological replica each with two technical replica in total.
Mass spectrometry data acquisition was randomized for strain background within each tissue to minimize strain batch effects. For blood glucose measurements mice older than 11 weeks and with blood glucose levels higher than 16 mM were selected and allocated to different treatment groups by randomization based on blood glucose levels. Animals were assigned to groups, based on BG so that all groups had equivalent and representative mean and SD blood glucose not significantly different from one another.
Mass spectrometry data acquisition was not blinded for the strain identity. For training the logistic (PPV) and comparative non-linear models we used nested 5-fold cross validation to ensure that reported performance metrics are based only on data unseen by the models during training and optimization. The regression coefficients are reported from 20 models. Blood glucose measurements were recorded by a researcher not familiar with the grouping and without a record of the group identities.
Monoclonal mouse HUI018 was made in Novo Nordisk against human insulin using classical hybridoma technology. The polyclonal antibody pAB 4077 fractions E+F was raised against rat insulin 1+2 in guinea pigs and isolated by chromatography. Antibodies are described in Andersen et al. 1993 (https://pubmed.ncbi.nlm.nih.gov/8472350/).
Plasma insulin level was measured by an in house developed Luminescence Oxygen Channeling Immunoassay (LOCI). 5 ug/mL, 35 uL/ well of mAb HUI018 conjugated acceptor beads is used for plate coating, followed by incubation with detection antibody at 6 ug/mL,